This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
I will ask my own questions about this data set. There are MANY variables in this data set and I will explore between 10- 15 variables in the this analysis.
process , this keys are: ListingKey, ListingNumber,LoanKey,MemberKey .
## ListingCreationDate CreditGrade
## 0 0
## Term LoanStatus
## 0 0
## ClosedDate BorrowerAPR
## 0 25
## BorrowerRate LenderYield
## 0 0
## EstimatedEffectiveYield EstimatedLoss
## 29084 29084
## EstimatedReturn ProsperRating..numeric.
## 29084 29084
## ProsperRating..Alpha. ProsperScore
## 0 29084
## ListingCategory..numeric. BorrowerState
## 0 0
## Occupation EmploymentStatus
## 0 0
## EmploymentStatusDuration IsBorrowerHomeowner
## 7625 0
## CurrentlyInGroup GroupKey
## 0 0
## DateCreditPulled CreditScoreRangeLower
## 0 591
## CreditScoreRangeUpper FirstRecordedCreditLine
## 591 0
## CurrentCreditLines OpenCreditLines
## 7604 7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## 697 0
## OpenRevolvingMonthlyPayment InquiriesLast6Months
## 0 697
## TotalInquiries CurrentDelinquencies
## 1159 697
## AmountDelinquent DelinquenciesLast7Years
## 7622 990
## PublicRecordsLast10Years PublicRecordsLast12Months
## 697 7604
## RevolvingCreditBalance BankcardUtilization
## 7604 7604
## AvailableBankcardCredit TotalTrades
## 7544 7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## 7544 7544
## DebtToIncomeRatio IncomeRange
## 8554 0
## IncomeVerifiable StatedMonthlyIncome
## 0 0
## TotalProsperLoans TotalProsperPaymentsBilled
## 91852 91852
## OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate
## 91852 91852
## ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed
## 91852 91852
## ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing
## 91852 95009
## LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber
## 0 96985
## LoanMonthsSinceOrigination LoanNumber
## 0 0
## LoanOriginalAmount LoanOriginationDate
## 0 0
## LoanOriginationQuarter MonthlyLoanPayment
## 0 0
## LP_CustomerPayments LP_CustomerPrincipalPayments
## 0 0
## LP_InterestandFees LP_ServiceFees
## 0 0
## LP_CollectionFees LP_GrossPrincipalLoss
## 0 0
## LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments
## 0 0
## PercentFunded Recommendations
## 0 0
## InvestmentFromFriendsCount InvestmentFromFriendsAmount
## 0 0
## Investors
## 0
as a result of the previous step , their is a lot of columns with very large number of n/a (more than 2000 instances! ) , i will drop theis columns. theis columns includs: EstimatedEffectiveYield, EstimatedLoss , EstimatedReturn , ProsperRating..Alpha. , CurrentCreditLines , OpenCreditLines , AmountDelinquent, PublicRecordsLast12Months , RevolvingCreditBalance, BankcardUtilization , AvailableBankcardCredit , TotalTrades , TradesNeverDelinquent..percentage., TradesOpenedLast6Months ,DebtToIncomeRatio, TotalProsperLoans, TotalProsperPaymentsBilled , OnTimeProsperPayments , ProsperPaymentsLessThanOneMonthLate , ProsperPaymentsOneMonthPlusLate , ProsperPrincipalBorrowed , ProsperPrincipalOutstanding , ScorexChangeAtTimeOfListing,OpenCreditLines, CurrentCreditLines, CreditScoreRangeUpper,EstimatedReturn,EstimatedLoss, LoanFirstDefaultedCycleNumber.
## ListingCreationDate CreditGrade
## 0 0
## Term LoanStatus
## 0 0
## ClosedDate BorrowerAPR
## 0 25
## BorrowerRate LenderYield
## 0 0
## EstimatedEffectiveYield EstimatedLoss
## 29084 29084
## EstimatedReturn ProsperRating..numeric.
## 29084 29084
## ProsperRating..Alpha. ProsperScore
## 0 29084
## ListingCategory..numeric. BorrowerState
## 0 0
## Occupation EmploymentStatus
## 0 0
## EmploymentStatusDuration IsBorrowerHomeowner
## 7625 0
## CurrentlyInGroup GroupKey
## 0 0
## DateCreditPulled CreditScoreRangeLower
## 0 591
## CreditScoreRangeUpper FirstRecordedCreditLine
## 591 0
## CurrentCreditLines OpenCreditLines
## 7604 7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## 697 0
## OpenRevolvingMonthlyPayment InquiriesLast6Months
## 0 697
## TotalInquiries CurrentDelinquencies
## 1159 697
## AmountDelinquent DelinquenciesLast7Years
## 7622 990
## PublicRecordsLast10Years PublicRecordsLast12Months
## 697 7604
## RevolvingCreditBalance BankcardUtilization
## 7604 7604
## AvailableBankcardCredit TotalTrades
## 7544 7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## 7544 7544
## DebtToIncomeRatio IncomeRange
## 8554 0
## IncomeVerifiable StatedMonthlyIncome
## 0 0
## TotalProsperLoans TotalProsperPaymentsBilled
## 91852 91852
## OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate
## 91852 91852
## ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed
## 91852 91852
## ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing
## 91852 95009
## LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber
## 0 96985
## LoanMonthsSinceOrigination LoanNumber
## 0 0
## LoanOriginalAmount LoanOriginationDate
## 0 0
## LoanOriginationQuarter MonthlyLoanPayment
## 0 0
## LP_CustomerPayments LP_CustomerPrincipalPayments
## 0 0
## LP_InterestandFees LP_ServiceFees
## 0 0
## LP_CollectionFees LP_GrossPrincipalLoss
## 0 0
## LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments
## 0 0
## PercentFunded Recommendations
## 0 0
## InvestmentFromFriendsCount InvestmentFromFriendsAmount
## 0 0
## Investors
## 0
in the previous step i droped the columns with high number of n/a , so it remains some cells with n/a’s now i will drop the n/a rows and save them in new record called new_df
after the previous cleaning steps, our new data frame contains 50 feature and 8453 observation we will work on them. The new data set contains data from 2009.
## 'data.frame': 84834 obs. of 52 variables:
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 111894 64760 85967 100310 72556 74019 97834 97834 54939 100485 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 60 36 36 36 36 60 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 4 4 4 4 4 4 4 4 8 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.12 0.125 0.246 0.154 0.31 ...
## $ BorrowerRate : num 0.092 0.0974 0.2085 0.1314 0.2712 ...
## $ LenderYield : num 0.082 0.0874 0.1985 0.1214 0.2612 ...
## $ ProsperRating..numeric. : int 6 6 3 5 2 4 7 7 4 5 ...
## $ ProsperScore : num 7 9 4 10 2 4 9 11 7 4 ...
## $ ListingCategory..numeric. : int 2 16 2 1 1 2 7 7 1 1 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 12 25 34 18 6 16 16 22 3 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 43 52 21 43 50 29 24 24 22 50 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 44 113 44 82 172 103 269 269 300 1 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 1 2 2 2 1 1 2 2 1 1 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 111883 64724 85857 100382 72500 73937 97888 97888 53800 100573 ...
## $ CreditScoreRangeLower : int 680 800 680 740 680 700 820 820 640 680 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 6617 2247 9498 497 8265 7685 5543 5543 4395 6853 ...
## $ TotalCreditLinespast7years : int 29 29 49 49 20 10 32 32 56 29 ...
## $ OpenRevolvingAccounts : int 13 7 6 13 6 5 12 12 4 8 ...
## $ OpenRevolvingMonthlyPayment : num 389 115 220 1410 214 101 219 219 25 290 ...
## $ InquiriesLast6Months : int 3 0 1 0 0 3 1 1 1 1 ...
## $ TotalInquiries : num 5 1 9 2 0 16 6 6 2 4 ...
## $ CurrentDelinquencies : int 0 4 0 0 0 0 0 0 1 0 ...
## $ DelinquenciesLast7Years : int 0 14 0 0 0 0 0 0 28 0 ...
## $ PublicRecordsLast10Years : int 1 0 0 0 0 1 0 0 1 0 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 5 4 3 3 4 4 4 4 6 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 6125 2875 9583 8333 2083 ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 3 ...
## $ LoanMonthsSinceOrigination : int 0 16 6 3 11 10 3 3 22 2 ...
## $ LoanNumber : int 134815 77296 102670 123257 88353 90051 121268 121268 65946 125045 ...
## $ LoanOriginalAmount : int 10000 10000 15000 15000 3000 10000 10000 10000 13500 4000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 1866 1535 1757 1821 1649 1666 1813 1813 1419 1829 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 8 32 24 33 16 16 33 33 15 8 ...
## $ MonthlyLoanPayment : num 319 321 564 342 123 ...
## $ LP_CustomerPayments : num 0 5143 2820 679 1227 ...
## $ LP_CustomerPrincipalPayments : num 0 4091 1563 352 604 ...
## $ LP_InterestandFees : num 0 1052 1257 327 622 ...
## $ LP_ServiceFees : num 0 -108 -60.3 -25.3 -22.9 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments: num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 1 158 20 1 1 1 1 1 19 1 ...
ListingCreationDate: The date the listing was created.
Term: The length of the loan expressed in months.
LoanStatus : The current status of the loan: Cancelled, Chargedoff,Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
BorrowerRate: The Borrower’s interest rate for this loan.
ProsperRating (numeric): The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
ListingCategory : The category of the listing that the borrower selected when posting their listing: 0 Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans .
EmploymentStatus: The employment status of the borrower at the time they posted the listing.
EmploymentStatusDuration: The length in months of the employment status at the time the listing was created.
CurrentlyInGroup: Specifies whether or not the Borrower was in a group at the time the listing was created.
LoanOriginalAmount: The origination amount of the loan.
LoanOriginationQuarter: The quarter in which the loan was originated.
prosperScore: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
IncomeRange : The income range of the borrower at the time the listing was created.
- for the LoanStatus feature , what is the count of each LoanStatus? and distribution of them?
- what is the distribution of prosperScore ?
- what is the Distribution of BorrowerRate ? 4 what is the ProsperRating distribution?
- distribution of loans with the days:
- what is the distribution of loans with the months ?
- what is the Distribution of ListingCategory?
- what is the Distribution of EmploymentStatus?
- what is the Distribution EmploymentStatusDuration ?
- what is the distribution of IsBorrowerHomeowner?
- what is the distribution of CurrentlyInGroup ?
- Distribution of LoanOriginalAmount ?
- what is the Distribution of LoanOriginationQuarter?
- what is the distribution of terms?
- what is the distibution of income range ?
- What’s the distribution of BorrowerState?
and distribution of them?
## # A tibble: 11 x 2
## # Groups: LoanStatus [11]
## LoanStatus n
## <fct> <int>
## 1 Chargedoff 5334
## 2 Completed 19657
## 3 Current 56566
## 4 Defaulted 1005
## 5 FinalPaymentInProgress 205
## 6 Past Due (>120 days) 16
## 7 Past Due (1-15 days) 806
## 8 Past Due (16-30 days) 265
## 9 Past Due (31-60 days) 363
## 10 Past Due (61-90 days) 313
## 11 Past Due (91-120 days) 304
The loan status is most important feautre that is an indicator to the loan sucsess . As a result of status group visualization , the loan status with the highest count is the current status followed by completed status. I notice that the categories are more detailed , so I’m going to group the similar groups togother as the following: Defaulted: Chargedoff, Defaulted,Cancelled , Current : Current , FinalPaymentInProgress, completed for the completed status, and the other values in the Past Due group.
the new groups plotted in the next cell:
After new groups created , it still that current status is the highest group , then completed status then the defualted . In the bivariate exploration step we will plot the groups of completed and defulted with the other features .
The highest category prosperRating is for C and the lowest one is for AA , note that no N/A category because I’ve already dropped them in the data cleaning step.
Now we will find the distribution of loans among the history (forwarding of days).
## Warning in strptime(xx, ff, tz = "GMT"): unable to identify current timezone 'W':
## please set environment variable 'TZ'
## [1] '0.7.7'
*Employed ones have the highest count , this is because the employed have more stability than the others and have more dealing with the banks.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.500 6.167 8.588 12.333 62.917
- It seems abviously that loan amounts are integers (they have peaks) like , 4000 , 10,000, 15,000 , 20,000 and other integers , and the loans of 4000 have the highest count.
new_df %>%
mutate(BorrowerState=BorrowerState %>% fct_infreq() )%>%
ggplot(aes(x=BorrowerState)) + geom_bar()
The state with the highest number of loans is CA then NY,TX,FL . On the other hand, the states with the lowest number of loans are SD , VT , AK and WY.
I think year of the loan , IncomeVerifiable ,so i will investegate them in bivariate section .
yes ,I created many new variables as the following: 1. In question 2.1 , I grouped the loanStatus in more generic group , I named it by LoanStatusGroup .
I extracted the month from ListingCreatenDate and ploted the distribution of them in 2.5 question.
In question 2.7 , I created a new column named LisitingCategory that mapping the LisitingCategory..numeric. to it’s name , like (0 - Not Available, 1 -
Debt Consolidation, 2. Home Improvement, …etc).
In question 2.9 , I created a new column named EmploymentStatusDurationYears by dividing EmploymentStatusDuration (months) by 12 .
In Question 2.14 , I created a new column named ProsperRating that mapping the ProsperRating..numeric. to it’s name , like (0 - N/A, 1 - HR, 2 - E, … etc).
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?
The employedStatusDuration plot is skewed right (not normally distributed) and ProsperScore is near to normal distribution but with no peak! Yes I Performed some operations to tidy data , before I started the visualization and exploration of data , I dropped unnecissary columns and the columns with high number of NA’s, then I droped the remaining NA rows . Then when I started the exploration step , I created many new columns of the existing columns , I explained this step in the previous question.
In this part , I will explore the correlation coefficient between some numeric variables and visualize the strong relationships(high correlation). For the categorical variables , I will split the categories and explore them. and For the date , i will explore different variables among date.
##
## Pearson's product-moment correlation
##
## data: EmploymentStatusDurationYears and LoanOriginalAmount
## t = 22.851, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07152509 0.08490118
## sample estimates:
## cor
## 0.07821665
##
## Pearson's product-moment correlation
##
## data: ProsperRating..numeric. and BorrowerRate
## t = -917.27, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9537178 -0.9524851
## sample estimates:
## cor
## -0.9531054
Strong negative relation !
##
## Pearson's product-moment correlation
##
## data: ProsperRating..numeric. and LoanOriginalAmount
## t = 138.17, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4231065 0.4340925
## sample estimates:
## cor
## 0.4286153
correlation between ProsperRating..numeric. and LoanOriginalAmount is medium , see the following visualization
##
## Pearson's product-moment correlation
##
## data: EmploymentStatusDurationYears and LoanOriginalAmount
## t = 22.851, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07152509 0.08490118
## sample estimates:
## cor
## 0.07821665
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and LoanOriginalAmount
## t = -132.27, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4190646 -0.4079072
## sample estimates:
## cor
## -0.4135014
correlation between BorrowerRate is LoanOriginalAmount medium negative , see the following visualization :
##
## Pearson's product-moment correlation
##
## data: ProsperScore and ProsperRating..numeric.
## t = 289.71, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7018241 0.7085893
## sample estimates:
## cor
## 0.7052228
as you see, strong correlation between ProsperScore and ProsperRating, see the following visualization.
as a result , HR,E,D ratings have higher prpability of defaulting , in the other hand AA,A,B ratings , have lower proabability of defaulting.
## # A tibble: 7 x 2
## # Groups: IncomeRange [7]
## IncomeRange n
## <fct> <int>
## 1 $0 45
## 2 $1-24,999 4652
## 3 $100,000+ 15202
## 4 $25,000-49,999 24167
## 5 $50,000-74,999 25623
## 6 $75,000-99,999 14496
## 7 Not employed 649
The number of borrowers with 0- salary or not employed is too small , although of that , it’s not surprisng that theis two categories have higher probability of defaulting . In general , lower salary income leads to higher probability of defaulting . Also we can observe that higher salaries have high percentage of current , that means employees with higher salary are increasing in the bank , to ensure that I will explore income range among dates later in this analyisis.
## Warning: position_stack requires non-overlapping x intervals
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
now i will extract the year from listingCreationDate to use it in next visualizations.
## Warning: Expected 2 pieces. Additional pieces discarded in 84834 rows [1,
## 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
The number of defaulted loans decreases by new years and the current groups increases because they are currently created.
* As I said in the previous plot , the number of loans in early years is small and increasing toward the years . It is abvious that upon 2012 borrower rate is not distributed normally , and In 2013 , 2014 ( first 3 months only ) the distribution is skewed right.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Computation failed in `stat_smooth()`:
## x has insufficient unique values to support 10 knots: reduce k.
I investigated the relationship between loanStatusGroup (after grouping them in 4 general groups, completed, defaulted, past due, current ) and 8 other features , as the following : 5 numeric features ( year, EmploymentStatusDurationYears, Term, ProsperScore, BorrowerRate) and 4 categorical features (IncomeRange, EmploymentStatus, ListingCategory,ProsperRating). Also I explored the changing of theis features among the years. As a result , the following features have a relationship with the loan status and can be used as a good indicator of defaulting : prosper score , employed status , income range, prosper rating and borrower rate.
As in 3.1.2 their is a strong negative relationship between prosperRating and borrowerRate with correlation coefitiont of about - 0.95! And as in 3.1.6 , their is a strong positive relationship between ProsperScore and ProsperRating with correlation coefficient of about 0.705 .
for the main features of interest , the strongest relationship is between loan status group and prosper score followed by the income range . and for the other features , the strongest one is between prosperRating and borrowerRate with correlation coefficient of about - 0.95 .
## Warning: Ignoring unknown parameters: binwidth, bins, pad
in the previous plot , I ploted prosper rating in combinations of loan status among years , you can see that for most of defaulted plots are skewed right (negative ratings more) and the completed plots skewed left (positive ratings more).
status groups.
We got the same result of 3.1.6 that concludes that prosper rating and prosper score have strong positive relationship.
years.
ProsperRating .
prosper rating.
prosperRating among years.
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
You can see that default percent of 2013 and 2014 is too small. And we got the same result as we observed in the previous plots for each of prosper score and prosper rating that low prosper scores or low prosper rating means high default percent and high values of them lead to low percent of default percent.
As a result of part 4.6 , I found that loans of high borrower rate (0.3) have a low prosper rating and as in 4.7 the low prosper rate has a higher default percent.Also I found that the the number of defaulted loans decreases by years , see 3.2.10 and 4.8 , and if we connect this observation with the borrower rate , we can see that also by years the borrower rates are decreases as in 3.2.13 and 3.2.16 .
the combinantion of IncomeVerifiable with IncomeRange has a high effect at the default percent! In addition to that low income ranges have high percent of defaulting , the false status of IncomeVerifiable also increase the default percent. - - - - - -
HR,E,D ratings (worst ratings) have higher prpability of defaulting , in the other hand AA,A,B ratings (best ratings) have lower proabability of defaulting. So we can conclude that lower prosper ratings means higher probability of defaulting.
vs ProsperRating .
borrowerRate affect prosperRating more than ProsperScore , notice that at high prosperRating(>4) the region is yellow that means lower borrower rate. And each of borrower rate and Prosper rating have a high correlation with the default ratio , so we can say that higher borrower rate means lower prosper rating then highr default probability.
verifiable.
It’s abvious that the combinantion of IncomeVerifiable with IncomeRange has a high effect at the default percent! In addition to that income ranges have high percent of defaulting , the false status of IncomeVerifiable also increase the default percent.
Struggle in this work is in understanding and cleaning data step , their is a lot of banks keywords that I don’t listen before , so I took about 2 days only for understanding the the proposed features and select the key ones to investigate them , I think this is the most important step of any analysis work.
I felt with Success when the relations and interactions of featuers started to appear.
After finishing the previous analysis I found many features that affect defaulting the loans, we can take care to them to improve the bank work by reduce the default percent. Theis features includes IncomeRange , IncomeVerifiable ,ProsperRating , ProsperScore , BorrowerRate , in the following more details about most influence ones:
IncomeRange: It has a high influence on default loan status , we can say that lower income ranges or not employed status have a higher probability of defaulting. On the other hand , higher income ranges have lower probability of defaulting.
ProsperRating: It has a high influence on default loan status , we can say that lower ProsperRating(HR,E,D) have a higher probability of defaulting. On the other hand , higher ProsperRatings(A,AA,B) have lower probability of defaulting.
BorrowerRate: It has a high influence on default loan status , we can say that higher BorrowerRates leads to lower prosper rating and a higher probability of defaulting. On the other hand , lower borrowerRates have ahigher prosperRating and lower probability of defaulting.
IncomeVerifiable: It has a high influence on default loan status , we can say that loans in the True IncomeVerifiable group have a lower probability of defaulting. On the other hand , loans in the False IncomeVerifiable group have a higher probability of defaulting.